Skip to content

Conversation

@lbdreyer
Copy link
Member

@lbdreyer lbdreyer commented Mar 9, 2017

This reuses the code that recursively builds a multidimensional stacked dask array in two places.

@lbdreyer lbdreyer added the dask label Mar 9, 2017
@lbdreyer lbdreyer changed the title Reuse multidim_daskstack in merge in fast um loading. Reuse multidim_daskstack in merge + fast um loading. Mar 9, 2017
@lbdreyer
Copy link
Member Author

lbdreyer commented Mar 9, 2017

@dkillick @pp-mo @bjlittle

Args:
* data:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your input argument is called "stack", not "data".

self._data_cache = [da.stack(self._data_cache[i:i+size]) for i
in range(0, len(self._data_cache), size)]
self._data_cache, = self._data_cache
data_arrays = np.array([f._data for f in self.fields],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to convert each f._data into a dask array here. This isn't done by multidim_daskstack, unless calling da.stack converts the inputs into dask arrays and then stacks them? Either way, the conversion to arrays is not being done by the common API from #2421, which it probably should be.

class Test_multidim_daskstack(tests.IrisTest):
def test_0d(self):
value = 4
data = np.array(da.from_array(np.array(value), chunks=1), dtype=object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These from_array calls will need to be replaced after #2421 has been merged...

vals = [4, 8, 11]
data = np.array([da.from_array(val*np.ones((2, 2)), chunks=2) for val
in vals], dtype=object)
data = data.reshape(3, 1, 2, 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue this data is no longer 2D! Perhaps you could rename this test to test_nd?

# You should have received a copy of the GNU Lesser General Public License
# along with Iris. If not, see <http://www.gnu.org/licenses/>.
"""Test :meth:`iris._lazy data.array_masked_to_nans` method."""
"""Test function :func:`iris._lazy data.array_masked_to_nans`."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

# You should have received a copy of the GNU Lesser General Public License
# along with Iris. If not, see <http://www.gnu.org/licenses/>.
"""Test :meth:`iris._lazy data.is_lazy_data` method."""
"""Test function :func:`iris._lazy data.is_lazy_data`."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

self.assertArrayEqual(result[:, :, 0, 0], np.array(vals).reshape(3, 1))


if __name__ == '__main__':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given my comment above about whether you have NumPy arrays or dask arrays at the end of this process, you could consider adding a test to check what happens when you pass NumPy arrays in and dask arrays in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I don't think we need to worry about what happens if you pass the routine numpy arrays (i.e. a numpy object array containing numpy arrays).
In intended use the argument is as stated "an ndarray of dask arrays", so we should just test that.

@lbdreyer lbdreyer force-pushed the multi_dim_stack_dask branch from 123e252 to 975105b Compare March 9, 2017 18:10
@lbdreyer lbdreyer force-pushed the multi_dim_stack_dask branch from 048c521 to 717e7cf Compare March 10, 2017 14:06
@lbdreyer
Copy link
Member Author

@dkillick @pp-mo the changes based on you two's reviews are now up!

Copy link
Member

@DPeterK DPeterK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple more comments from me but it's looking good! 🌴 👍

def _check(self, stack_shape):
vals = np.arange(np.prod(stack_shape)).reshape(stack_shape)
stack = np.empty(stack_shape, 'object')
stack_element_shape = (4, 5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These magic numbers could probably do with a little explanation...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not magic numbers. I just chose them as they were small enough that I could produce a 4D result (for test_2d_lazy_stack of shape (3, 2, 4, 5) where the length of the dimensions is not repeated.

It's equivalent to a pp field data of shape (4,5)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically they are, as they're numbers that appear without much introduction! A comment on the line above that said "equivalent to a pp field data of shape (4,5)" would nicely sort this and improve code understand-ability, I think.

Copy link
Member Author

@lbdreyer lbdreyer Mar 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possibly better to think of the numbers as "test input", similar to when you create an array to test an array operation, that array is thought of as "test input data".

In these tests, I am creating a stack of val*np.ones((4,5)) arrays where the shape of each element in the stack is (4,5). So there's nothing special about the numbers I just need something to create an array and then I reuse those numbers to check I'm getting the right output shape.

My worry with the comment "equivalent to a pp field data of shape (4,5)" is that that is an example of an implementation of multidim_lazy_stack.

I would consider renaming the variable though. Would stack_element_dask_array_shape be clearer?

shape = (2,)
result = self._check(shape)

def test_2d_lazy_stack(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any chance of a >2D test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit unnecessary? The 2d test is already testing the recursivity

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, however part of testing is checking edge cases, and I don't know what would happen if I passed a 7D array to it. Evidently we don't need to test all possible dimensionalities (!), but it would be good to test something that's outside of the boundary of logical intent for the functionality being tested. This is just in case an unsafe assumption has been made about how the function will be used; "of course, no-one would ever use this for more than a 2D input".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not hung up on this though, so I'm not going to make a right fuss if it's left out. @bjlittle @pp-mo do you have any inputs to this?

Copy link
Member

@pp-mo pp-mo Mar 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a mockist / white-box-y point of view, the existing tests do already cover all 3 code branches.
The implementation relies on iteration over the passed stack to deconstruct the input one dimension at a time, but 2d is already checking that.
One more would do no harm, I suppose, but I don't really expect it to go wrong !

self.vector_dims_shape + data_arrays.shape[1:])
self._data_cache = multidim_daskstack(data_arrays)
stack = np.empty(np.prod(self.vector_dims_shape), 'object')
for index, f in enumerate(self.fields):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, I so would have gone for "i, field" here...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I used f as that was what was used in the previous code. I will change it to
for index, field in ...

@DPeterK
Copy link
Member

DPeterK commented Mar 10, 2017

@bjlittle @pp-mo do you have any comments to add? Otherwise I'm minded to merge this when Travis is happy.

self._data_cache = [da.stack(self._data_cache[i:i+size]) for i
in range(0, len(self._data_cache), size)]
self._data_cache, = self._data_cache
stack = np.empty(np.prod(self.vector_dims_shape), 'object')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be neater + clearer to iterate with a multidimensional index here, thus avoiding the np.prod(shape) and the subsequent stack.reshape.
Your magic friend for that is np.ndindex(shape), which gives you indices to all elements of a multidimensional array, as used here for example
I think that gets you something like ...

stack = np.empty(self.vector_dims_shape, 'object')
for field, nd_index in zip(fields, np.ndindex(self.vector_dims_shape)):
    stack[nd_index] = as_lazy_data(field._data, chunks=field._data.shape)
self._data_cache = multidim_lazy_stack(stack)

shape = (2,)
result = self._check(shape)

def test_2d_lazy_stack(self):
Copy link
Member

@pp-mo pp-mo Mar 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a mockist / white-box-y point of view, the existing tests do already cover all 3 code branches.
The implementation relies on iteration over the passed stack to deconstruct the input one dimension at a time, but 2d is already checking that.
One more would do no harm, I suppose, but I don't really expect it to go wrong !

for index, val in np.ndenumerate(vals):
stack[index] = as_lazy_data(val*np.ones(stack_element_shape))
result = multidim_lazy_stack(stack)
self.assertEqual(result.shape, stack_shape + stack_element_shape)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could also check the actual values, to ensure that the ordering of elements is definitely correct ?
I think all that is needed is something like...

expected = np.empty(list(stack_shape)+list(stack_element_shape), dtype=int)
for index, val in np.ndenumerate(vals):
    stack[index] = as_lazy_data(val*np.ones(stack_element_shape))
    expected[index] = val
...
self.assertArrayAllClose(result.compute(), expected)

Copy link
Member Author

@lbdreyer lbdreyer Mar 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensure that the ordering of elements is definitely correct

Surely that's already achieved by using dimensions of different lengths?

Not that I'm against adding extra assurance.

@lbdreyer lbdreyer force-pushed the multi_dim_stack_dask branch from 815fe26 to 34f810d Compare March 13, 2017 14:20
@lbdreyer
Copy link
Member Author

@dkillick @pp-mo I have made the final changes based of your reviews, and tests pass...Merge?

(Please squash and merge)

@pp-mo pp-mo merged commit 52d4ccf into SciTools:dask Mar 13, 2017
@QuLogic QuLogic added this to the dask milestone Mar 13, 2017
bjlittle pushed a commit to bjlittle/iris that referenced this pull request May 31, 2017
Use multidimension dask stacks in merge and fast um loading.
@QuLogic QuLogic modified the milestones: dask, v2.0 Aug 2, 2017
@lbdreyer lbdreyer deleted the multi_dim_stack_dask branch July 23, 2018 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants